Reinforcement learning assisted oxygen therapy for COVID-19 patients under intensive care

Background Patients with severe Coronavirus disease 19 (COVID-19) typically require supplemental oxygen as an essential treatment. We developed a machine learning algorithm, based on deep Reinforcement Learning (RL), for continuous management of oxygen flow rate for critically ill patients under intensive care, which can identify the optimal personalized oxygen flow rate with strong potentials to reduce mortality rate relative to the current clinical practice. Methods We modeled the oxygen flow trajectory of COVID-19 patients and their health outcomes as a Markov decision process. Based on individual patient characteristics and health status, an optimal oxygen control policy is learned by using deep deterministic policy gradient (DDPG) and real-time recommends the oxygen flow rate to reduce the mortality rate. We assessed the performance of proposed methods through cross validation by using a retrospective cohort of 1372 critically ill patients with COVID-19 from New York University Langone Health ambulatory care with electronic health records from April 2020 to January 2021. Results The mean mortality rate under the RL algorithm is lower than the standard of care by 2.57% (95% CI: 2.08–3.06) reduction (P < 0.001) from 7.94% under the standard of care to 5.37% under our proposed algorithm. The averaged recommended oxygen flow rate is 1.28 L/min (95% CI: 1.14–1.42) lower than the rate delivered to patients. Thus, the RL algorithm could potentially lead to better intensive care treatment that can reduce the mortality rate, while saving the oxygen scarce resources. It can reduce the oxygen shortage issue and improve public health during the COVID-19 pandemic. Conclusions A personalized reinforcement learning oxygen flow control algorithm for COVID-19 patients under intensive care showed a substantial reduction in 7-day mortality rate as compared to the standard of care. In the overall cross validation cohort independent of the training data, mortality was lowest in patients for whom intensivists’ actual flow rate matched the RL decisions. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01712-6.

patients. The therapy of COVID-19 is guided by the knowledge and experience of moderate-to-severe ARDS treatment [1]. Oxygen therapy is recommended as the first-line therapy of COVID-19-induced respiratory and hypoxia by the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO). Oxygen therapy consists of different kinds of supplemental oxygen therapies including nasal cannula, simple mask, venturi mask, non-rebreather masks, and high flow oxygen systems. The key factor in different supplemental oxygen methods is the setting of different levels of oxygen flow rates [2]. Thus, the selection of appropriate oxygen flow rate is a crucial decision in COVID-19 treatment.
To improve the treatment efficiency, the administration of oxygen therapy should be determined by the severity of COVID-19-induced respiratory failure, incorporating the uncertainties in measurements of patient health status and prediction of individual outcomes to the oxygen decisions. It certainly requires a comprehensive investigation of the optimal and personalized oxygen flow rate.
Our research aims to explore effective oxygen therapy for COVID-19 patients based on continuous respiratory support and vital signs monitoring. A large collection of artificial intelligence (AI) and deep learning (DL) approaches have been proposed to accelerate the drug discovery and the process of diagnosis and treatment of COVID-19 disease [3,4]. Clinical studies in oxygen therapy and respiratory support have been made in a short period in the treatment of COVID-19 pneumonia [5,6]. However, respiratory failure remains the leading cause of death (69.5%) for SARS-CoV-2 [7]. Thus, we provide an AI algorithm for the oxygen flow control, based on the deep deterministic policy gradient (DDPG) [8], a widely used reinforcement learning (RL) method for continuous state and action spaces. DDPG uses off-policy data and the Bellman equation to learn the Q-function and then utilizes the resulted Q-function (critic network) to learn a deterministic policy (actor network). To stabilize the training, it considers slow-learning target networks, i.e., actor/critic target networks are updated slowly, hence keeping the estimated targets stable. The optimized policy can recommend personalized optimal oxygen flow rates for COVID-19 patients based on the knowledge of patient health status estimated from patients' electronic health records (EHRs).
Reinforcement learning has been successfully applied in the past to different healthcare problems such as multimorbidity management [9], HIV therapy [10], cancer treatment [11], and anemia treatment in hemodialysis patients [12]. For critical care, given the large amount and granular nature of electronically recorded data, RL is well suited for providing sequential optimal treatment recommendations and improving health outcomes for new ICU patients [13]. Recent studies include treatment strategies for sepsis in intensive care [14] and personalized regime of sedation dosage and ventilator support for patients in Intensive Care Units (ICUs) [15].
Focusing on RL-based oxygen flow rate control (RLoxygen), we studied its impact on mortality in COVID-19 patients with respiratory failure. The evolution of patients' ICU histories, including treatments, vitals, and health outcomes, was modeled using a Markov decision process (MDP) [14,16]. At each decision epoch, based on the state (observed patient characteristics, including age, sex, race, smoking status, BMI, and comorbidity diagnoses, 36 daily observed lab test values, and 6 unique vitals), RL selected an oxygen flow rate (ranged from 0 to 60 L/ min) and obtained a reward defined based on patient's 7-day survival. Then, following the oxygen flow rate suggested by RL policy, an estimated mortality rate was predicted to compare with the mortality rate in actual practice.

Study design and participants
Our research team used a retrospective cohort of the New York University Langone Health (NYULH) EHR data on COVID-19 patients to derive and validate the RL algorithm. Eligible patients had positive COVID-19 PCR test and had oxygen therapy in hospital between March 1st 2020 and January 19th 2021. We excluded COVID-19 patients aged below 50 and not been hospitalized as the lacked consistent documentation of vital signs, treatments, and laboratory tests. This study was approved by the NYULH IRB and the data were de-identified to ensure anonymity.
For each patient, we had access to demographic data, including age, sex, race, ethnicity and smoking status, ICU admits and discharge information, in-hospital living status, comorbidities, treatments, and laboratory test data. The comorbidities, including hyperlipidemia, coronary artery disease, heart failure, hypertension, diabetes, asthma or chronic obstructive pulmonary, dementia and stroke, are defined based on the International Classification of Diseases (ICD)-10 diagnosis codes. To reduce the feature dimensionality, we selected 36 laboratory tests based on two criteria: (1) less than 28% missing values; and (2) COVID-19 related tests and vital signs. In specific, we explore the associations between laboratory tests and COVID-19 based on existing literature and clinical findings. For example, recent studies have shown that a reduced estimated glomerular filtration rate (eGFR), low platelet count, low serum calcium level, increased white blood cell count, Neutrophil-to-lymphocyte ratio (NLR), and red blood cell distribution width-coefficient of variation (RDW-CV) are related to high risk of severity and mortality in patients with COVID-19 [17][18][19][20][21]. Additionally, some research suggests well-controlled blood glucose is associated with the lower mortality in COVID-19 patients with Type-2 diabetes [22] and continuous renal potassium level has correlation of hypokalemia, which is common among patients with COVID-19 [23]. Arterial blood gas analysis, including pH, Oxyhemoglobin saturation (SaO 2 ), oxygen saturation (SpO 2 ), partial pressure of oxygen (PaO 2 ) and bicarbonate (HCO 3 ), is commonly used biomarkers measuring the severity of ARDS [24,25].
In this study, we employed leave-one-hospital-out validation to evaluate the model performance. The whole dataset was divided into 4 batches by the hospital and then we take one batch as validation set and the rest as training set in each simulation.

RL algorithm overview
We model patient health trajectory and the clinical decisions during a course of intensive care over a period of ICU stay by a Markov decision process (MDP) with state, action, and reward. The state of a patient includes the observed patient demographics, vital signs, and laboratory tests at each time t . The action refers to the oxygen flow rate. After a sequence of actions, the patient receives a reward if he/she survives in the next 7 days; otherwise, a penalty to death will be given. The cumulative return is defined as the discounted sum of all rewards of each patient received during the ICU stay. The intrinsic design of RL provides a powerful tool to handle sparse and timedelayed reward signals, which makes them well-suited to overcome the heterogeneity of patient responses to actions and the delayed indications of the efficacy of treatments [14]. The details of state, action, and reward are listed as follows.
• State s t : observed patient's characteristics at each time t with information, including demographics, COVID-19 lab tests, and vital signs. • Action a t : oxygen flow rate ranged from 0 L/min to 60 L/min. • Reward r t : the reward of an action at time t is measured by its associated ultimate health outcome given the patient's health state. Similar to [14], we used inhospital mortality as the system-defined penalty and reward. When a patient survived, a positive reward was released at the end of the patient's trajectory (i.e., a `reward' of +15); a negative reward (i.e., a `penalty of −15) was issued if the patient died. We find that such a reward function can propagate the final health outcome backward to each decision and intervention over the period so that RL can predict long-term effects and dynamically guide the optimal oxygen flow treatment. • Discount factor γ : determines how much the RL agents balance rewards in the distant future relative to those in the immediate future. It can take values between 0 and 1 [16]. After considering the ICU stay tends to be short and conducting side experiments, we chose a value of 0.99, which means that we put nearly as much importance on late deaths as opposed to early deaths for each recommended oxygen flow rate.
The schematic of the proposed scheme with EHR cohort is shown in Fig. 1. As shown in the bottom part of this diagram, the electronic health data were collected from New York University Langone Health (NYULH) by following the clinical guide. At each time, the oxygen flow rate decision, denoted by a , was chosen based on current health state, denoted by s , of the patient and then a new heath state s ′ was observed at the next measurement time. We record the tuple s, a, s ′ , r in the experience replay memory with the zero reward r = 0 . At the end of the treatment, a positive reward was recorded (i.e., a `reward' of + 15) if patient survived; and a negative reward (i.e., a `penalty of -15) was issued if the patient died. Then we applied deep deterministic policy gradient (DDPG), as shown at the top part of Fig. 1, to learn the optimal decision policy from the experience replay memory. DDPG, composed of actor and critic networks, takes historical samples s, a, s ′ , r from EHR data to concurrently learn a critic network (Q-function approximation) and an actor network (policy). The critic network, denoted by Q π (s, a|θ) , is a nonlinear function that approximates the Q value function of the action a (i.e., oxygen flow rate) given a patient's health state s . The actor network representing a policy, denoted by π φ (s) , proposes an action for each given state through the mapping or equation a = π φ (s) . The critic loss, defined by the mean squared TD-error (see Eq. (5) in Additional file 1), is used to improve the approximation of Q-function. Based on the expected Q-value computed by the critic network, we use the policy gradient method to optimize the actor network. In sum, DDPG learns a scoring rule (critic network) which evaluates the performance of a candidate policy, i.e., it returns an oxygen flow rate given a patient's health state, and then uses such a rule to improve the decision making policy (actor network) by optimizing the score. See more details in Reinforcement Learning Algorithms Section of Additional file 1.

Model evaluation
We evaluated the RL-recommended oxygen therapy by comparing its efficacy with the observed one on the cohort from each validation hospital. At each decision time, the RL algorithm recommends an oxygen flow rate for the patient. If the absolute difference of recommended and the observed oxygen flow rate is less than 10 L/min, we say that RL is "consistent" with the critical care physicians.
When RL is discrepant with the oxygen flow rate used by physicians, the efficacy of the RL-recommended oxygen therapy is not directly observed. The problem then becomes how to assess the health outcomes in the future after taking RL recommendations. For this reason, we predicted the outcome of the RL-recommended treatment using Cox proportional hazards model, a regression model commonly used for investigating the association between the survival probability of patients during a period and predictor variables of interest in medical observational studies [26,27]. In short, a patient was labeled as "alive" if he/she survived after a treatment within seven days; otherwise, labeled as "deceased". Then we fitted a Cox survival model with demographics, vital signs, and lab tests as predictors and evaluated the effect of decision using the leave-one-hospital-out validation.
To assess the performance of the survival models, we compared predicted and observed outcomes (7-day living status) using 4 metrics: similarity, accuracy, Chi-squared test, and concordance index. Overall, the cosine similarity between predicted and actual survival is greater than Q π (s, a) = E ∞ t=0 γ t r t |s t = s, a t = a, π 99.9%, and the concordance index is 0.83. Both metrics indicate that the predictive model can effectively estimate unobserved health outcomes. Moreover, the paired Chisquared test (p-value < 0.0001) shows no significant difference between true and predicted survival.

Results
Overall, 1362 patients in NYULH EHR samples had a PCR-based COVID-19 diagnosis between March 2020 and January 2021. The demographic and clinic characteristics summary of the analysis cohort is shown in Table 1 The performance of the RL-oxygen is summarized in Table 2. Overall, the RL-oxygen algorithm shows superior performance compared to the clinical practice of oxygen therapy for COVID-19 patients. The overall 7-day estimated mortality under Physician prescribed oxygen was 7.94% (95% CI: 7.41-8.47), while overall estimated mortality under RL-oxygen was 5.37% (95% CI: 4.94-5.80), showing a 2.57% (95% CI: 2.08-3.06) reduction (P < 0.001). In addition, Table 2 depicts the characteristics of oxygen flow rate following the recommendations from both RL-oxygen and physicians. On average, the overall RL-oxygen flow rate was 1.28 L/min (95% CI: 1.14-1.42) lower than the rate delivered to patients.
The efficacy of the RL prescriptive algorithm was consistently observed across age, gender, BMI, and comorbidity subgroups ( Table 2). Demographically speaking, compared to the observed efficacies in patients of age 75 and younger, COVID-19 patients of age older than 75 observed higher efficacies from RL-oxygen recommended therapy than physician's recommendations. For example, 7-day estimated mortality rate under RL-oxygen for patients of age older than 80 was 5.87% (95% CI: 4.67-7.07) lower than under physician's therapy. In contrast, the 7-day estimated mortality rate under RL-oxygen was 0.55% (95% CI: 0.39-0.71) lower than that under physicians' therapy for patients aged between 50 and 65. Table 2 also shows that the RL-oxygen tends to be more effective for patients with comorbidities. Especially for COVID-19 patients with Asthma or chronic obstructive pulmonary, Dementia and Stroke, RL-oxygen reduced the 7-day mortality by 5.69%, 5.11% and 3.8% respectively on average.
We further studied 7-days mortality when the actually administered oxygen flow rate differed from the oxygen flow rate suggested by the RL-oxygen in Fig. 2. It shows how the observed mortality changes with the flow rate difference between RL-oxygen and physicians. This phenomenon suggests that increasing differences between the RL-oxygen and the observed delivering oxygen were associated with increasing observed mortality rates in a rate-dependent fashion. When the difference is minimum, we obtain the lowest 7-day mortality rate of 1.7%. Another observation from Fig. 2A is that the mortality rate increases when the RL-oxygen flow rate is lower or higher than the one from physicians. It suggests that both the oxygen deficit (lower oxygen flow rate than RL-oxygen recommendation) and the oxygen excess are sup-optimal for patients' outcomes. We observed a trend that RL-oxygen was in general lower than what was prescribed by the physicians and might result in better outcomes under a lower flow rate. It suggests that oxygen flow rates prescribed by doctors tend to be excessively high for some patients.
Last, we observed that the RL-oxygen and physicians recommended consistent flow rates in most times; see Fig. 2B. The overall distribution of oxygen flow rates recommended by RL-oxygen and physicians are presented in Fig. 3. It depicts how many measurement times each oxygen flow rate was recommended by RL-oxygen and physicians. In twenty-nine percent of the time, the patients received an oxygen flow close to the suggested rate within 5 L/min while forty-four percent of the time, the difference between the administered and suggested oxygen flow rates are within 10 L/min. Since the Table 2 Subgroup comparison of 7-day estimated mortality obtained using RL-oxygen algorithm and critical care physician decision guidance Categorical variables are summarized with frequencies (percentages) unless otherwise indicated. Continuous variables are summarized as the mean (standard error) of biomarkers * Variables indicate RL-oxygen is significantly different from physicians (p-value < 0.001)

Subgroups
Estimated mortality (%) Average oxygen (L/min) high-flow nasal oxygen (HFNO) therapy often increases flow rate in increments of 10 L/min up to 60 L/min [28], it suggests that RL-oxygen is consistent with physicians about 40-50% of the time.

Discussion
We used a RL approach to learn an optimal policy to continuously control the oxygen device for critically ill patients with COVID-19 who require oxygen therapy. As most people who become seriously unwell with COVID-19 have an acute respiratory illness [29,30], our algorithm has strong potential to improve individual health outcomes and reduce the COVID-19 mortality rate caused by respiratory failure. We designed the reward as the ultimate health outcome which is used to assess the performance of oxygen flow decisions along the treatment trajectory. As such, the reinforcement learning approach took uncertain outcomes and long-term treatment effects into consideration and made it smarter in understanding the long impact of an early decision on the final outcomes.
Our analysis suggests the current practice remains some potential to be improved as actual oxygen flow rate administered by intensivists showed more than fifty percent discrepancy with RL-oxygen recommendations. Importantly, we observe that RL-oxygen tends to prescribe lower oxygen flow rate than physician's prescribed rates but leads to better outcomes. This finding is especially important in the context of the ongoing and persistent medical oxygen shortages in some developing regions. As COVID-19 patient-care protocols have evolved, medical-grade oxygen is still considered as an essential resource to treatments for critically ill patients. In regions such as Africa, Middle East, and Asia, the surge in demand for medical oxygen to treat COVID-19 exacerbates preexisting gaps in medical-oxygen supplies, leading to substantial supply shortages.
Our analysis also identified some clinical patterns that RL-oxygen particularly works well. For example, patients with high risk (i.e., of age older than 75) observed higher efficacies than patients aged between 50 and 75 by using relatively lower averaged oxygen flow rate than actually administered. RL-oxygen also recommends a higher averaged oxygen flow rate may improve the health outcomes for patients aged from 50 to 65. Moreover, we also notice significant therapeutic discrepancies in patients with stroke and diabetes comorbidities. In both cases, RL-oxygen recommended higher averaged oxygen flow rate than doctors while showing a significant reduction in estimated mortality. In fact, these findings agree with recent studies which reported that "stroke survivors who underwent COVID-19 developed more acute respiratory distress syndrome and received more noninvasive mechanical ventilation" [31] and "diabetic patients required more oxygen therapy (60% vs. 26.9%)" [32].
Although our evaluation methodology controls for several confounding factors and shows high validation accuracy, sample scarcity and a large proportion of missing Fig. 2 A Comparison of the estimated 7-days mortality rates (y-axis) varying with the difference between the oxygen flow rate recommended by the RL optimal policy and that administered by doctors (x-axis) averaged over all time points per patient. The shaded area represents the 95% confidence interval. The smallest oxygen difference is mainly associated with the lowest 7-days mortality rates. The further away the dose received was from the suggested oxygen flow rate, the worse the outcome. B The histogram of oxygen flow rate difference between RL-oxygen and physicians (labels on the vertical axis)